As one of document analysis techniques, there is known a reputation analysis that analyzes a reputation of an object, based on intention representation within documents. A reputation analysis determines not only the quality of the object but also the quality for each viewpoint of evaluating the object. Therefore, a conventional reputation analysis requires not only a dictionary of intention representation but also a dictionary of viewpoints subjected to intention representation. Since the former dictionary of intention representation is not dependent on a particular field, it has a general versatility and can be used in various fields. On the other hand, since the latter dictionary of viewpoints is strongly dependent on a particular field, it is lack of a general versatility and thus needs to be separately composed for each field.
On the other hand, as a method for classifying a document set, there is known a document clustering. The document clustering can classify the document set according to contents of individual documents. Therefore, if the classification based on the viewpoint subjected to intention representation can be performed, the reputation analysis can be made without using the dictionary of viewpoints.
Also, there is known technique that uses a thesaurus in the document clustering. For example, there is technique that selects a layer on the thesaurus and classifies and integrates document clusters by using a registered word on the same layer. In this way, the granularity of the classification of the document clusters can be standardized. Also, the registered word of the thesaurus used for classification can be assigned to the classified document cluster as the classification label.
However, in the technique that classifies and integrates the document clusters using the registered word on the same layer of the thesaurus, the registered word in the thesaurus is widely distributed. Therefore, the number of document clusters increases. Also, the classification label is a narrow-sense word belonging to a lower level concept in the thesaurus. Therefore, it is difficult to intelligibly present the document classification result.
Embodiments described herein are directed to provide document classification apparatus, method, and program, which can intelligibly present a document analysis result.